Objectives:
https://blogs.bmj.com/bmj/2020/07/03/covid-19-has-highlighted-the-inadequate-and-unequal-access-to-high-quality-green-spaces/ https://www.weforum.org/agenda/2020/08/parks-green-spaces-mental-health-access-equality/
Scale of analysis: MSOA (2 of the 3 datasets at this granularity) Geographic scope: England (case data at MSOA level available for England only)
The ONS (Office for National Statistics) provides data on access to private green space (i.e. access to gardens) for each MSOA in Great Britain. Here I am using the most recent April 2020 edition of the data. I quickly, manually edited the ONS excel file to make it easier use the read_excel function for data import. Given it is unlikely that the ONS data will be updated during the course of this analysis, it was preferable to go for the quicker manual process than investing time in a re-producible programmatic approach.
private_green_space <- read_excel("./Data/osprivateoutdoorspacereferencetables.xlsx", sheet = "LAD gardens_2")
private_green_space %>%
sample_n_from_df(10)
The ONS (Office for National Statistics) also provides data on access to public green space (i.e. access to parks and playing fields) for each Lower Super Output Area (LSOA) in Great Britain. In this case no manual editing of the ONS excel file was required.
It should be noted that this access to public green space data is provided at a finer geographic resolution than the other datasets used in this analysis, which are provided at MSOA granularity. Each MSOA is broken down in a number of LSOAs, so later on it will be straight forward (with the help of a lookup table) to aggregate this data to enable comparision with the other data sources.
parks <- read_excel("./Data/ospublicgreenspacereferencetables.xlsx", sheet = "LAD Parks only")
parks %>%
sample_n_from_df(10)
data.gov.uk provides data on the number of Covid-19 cases (as confirmed by positive tests) occurring in each English MSOA. This data is provided in the form of weekly case numbers. A value of -99 is used if in a given week, zero, one or two cases are reported in a MSOA. This is presumably to reduce the risk of individual cases reported within the data being identified.
Below I import the data and check how NA values are used within the dataset, as no details are provided in the limited documentation provided by data.gov.uk. No NA are present for variables reporting the number of cases occurring. This means I can replace the -99 values with NA, making it easier to work with the case data in the remainder of this notebook.
cases <- read_csv("./Data/MSOAs_latest.csv")
# Check how NA values used in the case data
na_count <- cases %>%
map(~ sum(is.na(.)))
# reimport case data, replacing -99 in case columns with NA and defining column types
# (where readr initial guesses have been unsuccessful)
cases <- read_csv("./Data/MSOAs_latest.csv",
na = c(-99, "NA"),
col_types = cols(wk_05 = "d",
wk_06 = "d",
wk_07 = "d",
wk_08 = "d",
wk_09 = "d")
)
cases %>%
sample_n_from_df(10)
The data.gov.uk case data records cases across multiple columns. Each of the columns corresponds to a numbered week (starting from week 5 in 2020). So, below I transform the case data so it is in the tidy format expected by tidyverse tools (e.g. ggplot2).
# select all column names for variables reporting case data
case_colnames <- colnames(cases)[str_detect(colnames(cases),"^wk_")]
# find the first and last columns containing weekly cases
first_case_col <- case_colnames[1]
last_case_col <- case_colnames[length(case_colnames)]
# Tidy data and process dates
cases_tidy <- cases %>%
pivot_longer(cols = all_of(first_case_col):all_of(last_case_col),
names_to = "week", values_to = "cases") %>%
# convert week numbers to w/c dates
mutate(week = as.integer(str_sub(week, 4, 5))) %>%
mutate(week_commencing = lubridate::ymd( "2019-12-30" ) + lubridate::weeks(week - 1))
cases_tidy
As noted above, some geographic aggregation of the access to public green space data will be required. Specifically, aggregation from the LSOA to MSOA level. A look-up table for this mapping from LSOAs to MSOAs is available for the ONS and is imported below.
## Parsed with column specification:
## cols(
## OA11CD = col_character(),
## OAC11CD = col_character(),
## OAC11NM = col_character(),
## LSOA11CD = col_character(),
## LSOA11NM = col_character(),
## SOAC11CD = col_character(),
## SOAC11NM = col_character(),
## MSOA11CD = col_character(),
## MSOA11NM = col_character(),
## LAD17CD = col_character(),
## LAD17NM = col_character(),
## LACCD = col_character(),
## LACNM = col_character(),
## RGN11CD = col_character(),
## RGN11NM = col_character(),
## CTRY11CD = col_character(),
## CTRY11NM = col_character(),
## FID = col_double()
## )
populations <- read_csv("./Data/ons_mid_2019_population_estimates.csv")
## Parsed with column specification:
## cols(
## .default = col_number(),
## Code = col_character(),
## Name = col_character(),
## Geography1 = col_character()
## )
## See spec(...) for full column specifications.
populations
two sub groups (Flats and Houses) within the data, at this stage interested in both (also provided)
priv_gs_focus <- private_green_space %>%
select(`LAD code`, `LAD name`, `T_Address count`,
`T_Percentage of adresses with private outdoor space`,
`T_Average size of private outdoor space (m2)`
) %>%
rename(address_count = `T_Address count`,
perc_garden = `T_Percentage of adresses with private outdoor space`,
ave_garden_size = `T_Average size of private outdoor space (m2)`)
priv_gs_focus
There 15 NA values present across all columns in the priv_gs_focus dataframe. Given the number of NAs is small relative to the size of the dataframe (374, 5), and this is an exploratory data analysis I’ll leave NAs to be handled/removed by the plotting functions used below.
How to plot boxplot and histogram and align
format_hist_box_pair_plot <- function(x_lims){
list(scale_x_continuous(limits= x_lims),
theme_minimal()
)
}
hist_box_pair_plot <- function(df, x_var, x_lims, x_label, y_label,
colour = "grey40", binwidth = 0.01, ...){
histogram <- ggplot(data = df,
mapping = aes_string(x = x_var))
histogram <- histogram + geom_histogram(fill = colour, binwidth = binwidth) +
labs(x = NULL, y = y_label) +
format_hist_box_pair_plot(x_lims)
box_plot <- ggplot(data = df,
mapping = aes_string(x = x_var))
box_plot <- box_plot + geom_boxplot(colour = colour) +
scale_y_continuous(labels = NULL) +
labs(x = x_label) +
format_hist_box_pair_plot(x_lims)
#combine and align the histogram and box plot
cowplot::plot_grid(histogram, box_plot,
ncol = 1, rel_heights = c(3, 1),
align = 'v', axis = 'lr')
}
hist_box_pair_plot(priv_gs_focus, "perc_garden",
x_label = "Percentage of adresses with private outdoor space",
y_label = "number of Local Authority Districts",
x_lims = c(0, 1)
)
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
On how to pass variable names to function which wrapper around dplyr function calls https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
var_summary_stats <- function(column, na.rm = TRUE) {
# the argument column is expected to be numeric vector, as this function was designed
# to work with purrr::map (which passes dataframe columns as vectors to the function
# being mapped)
# return a single row dataframe containing the summary stats for column
tibble(
min = min(column, na.rm = na.rm),
q1 = quantile(column, probs = 0.25, na.rm = na.rm),
median = median(column, na.rm = na.rm),
q3 = quantile(column, probs = 0.75, na.rm = na.rm),
max = max(column, na.rm = na.rm),
iqr = IQR(column, na.rm = na.rm),
mean = mean(column, na.rm = na.rm),
sd = sd(column, na.rm = na.rm)
)
}
df_summary_stats <- function(df){
# get the names of the numeric variables in the dataframe
variable_name <- df %>%
select_if(is.numeric) %>%
names()
# return a dataframe containing summary stats for each numeric variable in df
df %>%
select_if(is.numeric) %>%
purrr::map_dfr(var_summary_stats) %>%
add_column(variable_name, .before = 1)
}
df_summary_stats(priv_gs_focus)
hist_box_pair_plot(priv_gs_focus, "ave_garden_size",
x_label = "Average size private outdoor space (m2)",
y_label = "number of Local Authority Districts",
x_lims = c(0, 2000),
binwidth = 50
)
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
p <- ggplot(data = priv_gs_focus,
mapping = aes(x = log10(ave_garden_size))
)
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
count_NA_in_df = function(df){
df %>%
map(~ sum(is.na(.x))) %>%
as_tibble()
}
parks
count_NA_in_df(parks)
cases_focus <- cases_tidy %>%
select(msoa11_cd, msoa11_hclnm, week_commencing, cases) %>%
arrange(msoa11_cd, week_commencing) %>%
group_by(msoa11_cd) %>%
mutate(culm_cases = cumsum(ifelse(is.na(cases), 0, cases)) )
cases_focus
count_NA_in_df(cases_focus)
calculating infections per 100,000
#process geog lookup table
msoa_to_lad <- geog_lookup %>%
select(MSOA11CD, MSOA11NM, LAD17CD, LAD17NM) %>%
# remove duplicate entries of each combination
unique()
# process populations table
populations_minimal <- populations %>%
select(Code, Name, Geography1, `All ages`)
# join cases with geog lookup
cases_lad <- cases_focus %>%
left_join(msoa_to_lad, by = c("msoa11_cd" = "MSOA11CD")) %>%
group_by(LAD17CD, LAD17NM) %>%
summarise(total_cases = sum(cases, na.rm = TRUE))
## `summarise()` regrouping output by 'LAD17CD' (override with `.groups` argument)
# then join with population and calculate positive tests per 100,000 people
cases_lad_pop <- cases_lad %>%
left_join(populations_minimal, by = c("LAD17CD" = "Code")) %>%
mutate(cases_per_100000_pop = (total_cases / `All ages`) * 1e+5)
# then join with private green space data
cases_lad_pop_gardens <- cases_lad_pop %>%
left_join(priv_gs_focus, by = c("LAD17CD" = "LAD code"))
# then join with parks data
cases_lad_pop_gardens_parks <- cases_lad_pop_gardens %>%
left_join(parks, by = c("LAD17CD" = "LAD code"))
# remove some redundant columns
analysis_df <- cases_lad_pop_gardens_parks %>%
select(-`LAD name.y`, - `LAD name.x`, - Name)
analysis_df
ONS data on access to parks is available at LSOA and LAD (Local Authority District) scales, whereas data.gov.uk Covid-19 case data is provided at the MSOA scale. Each MSOA is made up of a number of LSOAs, but it not clear how best average the variables detailing access to green space over multiple LSOAs (given that LSOAs can be different sizes in terms of population and geography). So, making the case and green space datasets comparable at the MSOA scale is challenging. However, each LAD is made up of multiple MSOAs, and it is more straight forward to aggregate Covid-19 case numbers from the MSOA scale to LAD scale (simply by addition). Case numbers can then be compared to the access to green space dataset, where LAD scale statistics are provided by the ONS. Presumably the ONS has taken an appropriate approach to translating the variables at different scale, an approach which is likely to be more sophisticated and better informed than an approach I might devise and implement.
format_park_cases_plot <- function(){
list(geom_point(alpha = 0.4),
geom_smooth(colour = "grey50", fill = "grey90"),
theme_minimal()
)
}
p <- ggplot(data = analysis_df,
mapping = aes(x = `Average distance to nearest Park (m)`,
y = log(cases_per_100000_pop))
)
p + format_park_cases_plot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 27 rows containing non-finite values (stat_smooth).
## Warning: Removed 27 rows containing missing values (geom_point).
p <- ggplot(data = analysis_df,
mapping = aes(x = log(`Average size of nearest Park (m2)`),
y = log(cases_per_100000_pop))
)
p + format_park_cases_plot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 27 rows containing non-finite values (stat_smooth).
## Warning: Removed 27 rows containing missing values (geom_point).
p <- ggplot(data = analysis_df,
mapping = aes(x = `Average number of Parks within one km`,
y = log(cases_per_100000_pop))
)
p + format_park_cases_plot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 27 rows containing non-finite values (stat_smooth).
## Warning: Removed 27 rows containing missing values (geom_point).
p <- ggplot(data = analysis_df,
mapping = aes(x = log(`Average combined size of Parks within 1km (m2)`),
y = log(total_cases))
)
p + format_park_cases_plot()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 23 rows containing non-finite values (stat_smooth).
## Warning: Removed 23 rows containing missing values (geom_point).
p <- ggplot(data = analysis_df,
mapping = aes(x = perc_garden,
y = log(total_cases))
)
p + geom_jitter(alpha = 0.3) +
scale_x_continuous(limits = c(0.5, 1))
## Warning: Removed 24 rows containing missing values (geom_point).
scale_y_continuous(limits = c(0, 1500))
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1.5e+03
p <- ggplot(data = analysis_df,
mapping = aes(x = ave_garden_size,
y = log(total_cases))
)
p + geom_jitter(alpha = 0.3)
## Warning: Removed 23 rows containing missing values (geom_point).
var_summary <- function(df, var, na.rm = TRUE) {
summary <- df %>%
summarise(min = min({{var}}, na.rm = na.rm),
q1 = quantile({{var}}, probs = 0.25, na.rm = na.rm),
median = median({{var}}, na.rm = na.rm),
q3 = quantile({{var}}, probs = 0.75, na.rm = na.rm),
max = max({{var}}, na.rm = na.rm),
iqr = IQR({{var}}, na.rm = na.rm),
mean = mean({{var}}, na.rm = na.rm),
sd = sd({{var}}, na.rm = na.rm)
)
summary[1, "variable"] <- toString(deparse(substitute(var)))
summary %>%
relocate(variable, .before = min)
}
var_summary(priv_gs_focus, perc_garden)